The aim of this analysis is to investigate diabetes prevalence over time and by country/region. The purpose is to identify countries and years with high diabetes prevalence.
The dataset ‘DIABETES evolution of diabetes over time’ is a global dataset of diabetes prevelance from the years 1980 to 2014 and contains a total of 14,000 observations and 7 variables: - “Country/Region/World”, - e.g. “Turkey”, “Bangladesh”, “New Zealand” - “ISO”, - A region code for international standard denominations for country subdivisions - “Sex”, - Two factor variable “Men” or “Women” - “Year”, - Ranges from 1980 to 2014 - “Age-standardised diabetes prevalence”, - Calculated as a percentage - “Lower 95% uncertainty interval” and - “Upper 95% uncertainty interval”. Table 1 below shows the first six observations of the full dataset.
# Read in Data
data_full <- read_csv("Data/Diabetes_data.csv")
## Rows: 14000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country/Region/World, ISO, Sex
## dbl (4): Year, Age-standardised diabetes prevalence, Lower 95% uncertainty i...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_full_head <- head(data_full)
kable(data_full_head,
caption = "Table 1: First Six Observations of the Full Diabetes Dataset",
digits = 2)
| Country/Region/World | ISO | Sex | Year | Age-standardised diabetes prevalence | Lower 95% uncertainty interval | Upper 95% uncertainty interval |
|---|---|---|---|---|---|---|
| Afghanistan | AFG | Men | 1980 | 0.04 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1981 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1982 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1983 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1984 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1985 | 0.05 | 0.02 | 0.09 |
The full dataset was reduced to 1000 observations through a random generation of row numbers. The variable “ISO” was removed as it was not necessary for analysis.** Figure 1** below shows the code used to tidy the full dataset into the reduced dataset.
include_graphics("Image/code_screenshot.png")
Figure 1: Code Screenshot of Data Tidying
Using the function str() the first 2 rows of the data is displayed to show the type of variables in the data set (numeric, character/factor etc.). The assessment requires a maximum of 5 variables, but both the “lower_95” and “upper_95” were kept as they work together.
head_data_2 <- head(data,2)
str(head_data_2)
## tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
## $ Country/Region/World: chr [1:2] "Russian Federation" "Czech Republic"
## $ Sex : chr [1:2] "Women" "Men"
## $ Year : num [1:2] 2009 2013
## $ diabetes_prevalence : num [1:2] 0.0779 0.0834
## $ lower_95 : num [1:2] 0.0441 0.0448
## $ upper_95 : num [1:2] 0.121 0.136
Two summary statistics were calculated for diabetes prevalence by “Year”. Table 2 shows the results of the summary statistics.
tail_data_summary <- tail(data_summary, 10)
kable(tail_data_summary,
caption = "Table 2: Mean and Standard Deviation of Diabetes Prevalence by Year (First 10 Rows)",
digits = 3,
row_number(10))
| Year | mean_diabetes | sd_diabetes | mean_upper95 | sd_upper95 | mean_lower95 | sd_lower95 |
|---|---|---|---|---|---|---|
| 2005 | 0.108 | 0.057 | 0.147 | 0.072 | 0.045 | 0.045 |
| 2006 | 0.093 | 0.056 | 0.131 | 0.071 | 0.043 | 0.043 |
| 2007 | 0.083 | 0.048 | 0.116 | 0.061 | 0.036 | 0.036 |
| 2008 | 0.101 | 0.046 | 0.141 | 0.057 | 0.035 | 0.035 |
| 2009 | 0.096 | 0.058 | 0.136 | 0.075 | 0.043 | 0.043 |
| 2010 | 0.102 | 0.071 | 0.144 | 0.090 | 0.053 | 0.053 |
| 2011 | 0.089 | 0.048 | 0.133 | 0.064 | 0.033 | 0.033 |
| 2012 | 0.120 | 0.064 | 0.176 | 0.083 | 0.046 | 0.046 |
| 2013 | 0.094 | 0.031 | 0.150 | 0.047 | 0.019 | 0.019 |
| 2014 | 0.093 | 0.046 | 0.153 | 0.066 | 0.029 | 0.029 |
From Table 2 we can see an increasing trend in mean diabetes prevalence from 2005 to 2014. 2009 had the highest mean diabetes prevalence at 11.1% from the period 2005 to 2014, but also the highest standard deviation.
A figure was created using the ggplot2 R package and the option geom_point(). This is displayed in Figure 2:
Figure_2 <- ggplot(data = data_summary, aes(x = Year, y = mean_diabetes)) +
geom_point(alpha = 0.7) +
labs(title = "Figure 2: Mean Diabetes Prevalence Increases Over Time",
caption = "geom_smooth(` using method = 'loess' and formula = 'y ~ x'",
subtitle = "Red Bars Represent Standard Deviation") +
xlab("Year") +
ylab("Mean Diabetes Prevalence") +
theme_minimal() +
geom_smooth() +
geom_errorbar(aes(ymin=mean_diabetes-sd_diabetes, ymax=mean_diabetes+sd_diabetes), colour="red", alpha=0.3)
ggplotly(Figure_2)
Australia_summary <- data_full %>%
filter(`Country/Region/World` == "Australia")
Figure_3 <- ggplot(data = Australia_summary, aes(x = Year, y = `Age-standardised diabetes prevalence`, col = Sex)) +
geom_point(alpha = 0.8) +
labs(title = "Figure 3: Men have Higher Risk of Diabetes",
caption = "geom_smooth(` using method = 'loess' and formula = 'y ~ x'",
subtitle = "Mean Diabetes Prevalence Has Increased Over Time") +
xlab("Year") +
ylab("Mean Diabetes Prevalence") +
theme_minimal() +
geom_smooth()
Figure_3
Figure 3 @ref(figure3) shows a trend of increasing mean diabates prevalence over time. Men have a noticeably higher mean than women. There is a steep increase from 1980 to 2000 and then a plateau. Data was only available up to 2010. It is unknown whether the plateua begins to trend downards.
Australia_table_summary <- data_full %>%
filter(`Country/Region/World` %in% c("Australia", "Germany", "China", "South Africa", "United States of America")) %>%
select(-ISO) %>%
group_by(`Country/Region/World`, Sex) %>%
summarise(`Mean diabetes prevalence` = mean(`Age-standardised diabetes prevalence`))
kable(Australia_table_summary,
caption = "Table 1: First Six Observations of the Full Diabetes Dataset",
digits = 3)
| Country/Region/World | Sex | Mean diabetes prevalence |
|---|---|---|
| Australia | Men | 0.064 |
| Australia | Women | 0.047 |
| China | Men | 0.060 |
| China | Women | 0.061 |
| Germany | Men | 0.056 |
| Germany | Women | 0.040 |
| South Africa | Men | 0.069 |
| South Africa | Women | 0.097 |
| United States of America | Men | 0.065 |
| United States of America | Women | 0.054 |
Five random countries were selected to compare mean diabetes prevalence by year and sex. Table 3 @ref(table3) presents that in Australia, Germany and United States of America, men have a higher mean diabetes prevalence than women. Mean diabetes prevalence for men and women in China are very similar with men being 0.001 higher. Interestingly, women in South Africa have a higher mean diabetes prevalence than men.